The Trilingual ALLEGRA Corpus: Presentation and Possible Use for Lexicon Induction
نویسندگان
چکیده
In this paper, we present a trilingual parallel corpus for German, Italian and Romansh, a Swiss minority language spoken in the canton of Grisons. The corpus called ALLEGRA contains press releases automatically gathered from the website of the cantonal administration of Grisons. Texts have been preprocessed and aligned with a current state-of-the-art sentence aligner. The corpus is one of the first of its kind, and can be of great interest, particularly for the creation of natural language processing resources and tools for Romansh. We illustrate the use of such a trilingual resource for automatic induction of bilingual lexicons, which is a real challenge for under-represented languages. We induce a bilingual lexicon for German-Romansh by phrase alignment and evaluate the resulting entries with the help of a reference lexicon. We then show that the use of the third language of the corpus – Italian – as a pivot language can improve the precision of the induced lexicon, without loss in terms of quality of the extracted pairs.
منابع مشابه
Lexica and corpora for speech-to-speech translation: a trilingual approach
Creation of lexica and corpora for Catalan, Spanish and US-English is described. A lexicon is being created for speech recognition and synthesis including relevant information. The lexicon contains 50K common words selected to achieve a wide coverage on the chosen domains, and 50K additional entries including special application words, and proper nouns. Furthermore, a large trilingual spontaneo...
متن کاملDeliverable D 4 . 5 Rwth Aachen
(for dissemination) We describe the experimental results using the baseline speech-to-speech translation systems created in D4.3 and compare them to an enhanced translation system taking different language resources into account. Experiments were performed on the trilingual corpus (English, Spanish, Catalan) built within the project in WP5. This corpus consists of spontaneous dialogues in the d...
متن کاملA viewing and processing tool for the analysis of a comparable corpus of Kiranti mythology
This presentation describes a trilingual corpus of three endangered languages of the Kiranti group (Tibeto-Burman family) from Eastern Nepal. The languages, which are exclusively oral, share a rich mythology, and it is thus possible to build a corpus of the same native narrative material in the three languages. The segments of similar semantic content are tagged with a "similarity" label to ide...
متن کاملHow textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs
Many elements contribute to the relative difficulty in acquiring specific aspects of English as a foreign language (Goldschneider & DeKeyser, 2001). Modal auxiliary verbs (e.g. could, might), are examples of a structure that is difficult for many learners. Not only are they particularly complex semantically, but especially in the Malaysian context ...
متن کاملMetaphorical Conceptualization of SPORT Through TERRITORY as a Vehicle
WAR as a vehicle and Sport Is War as a conceptual metaphor (CM) seem inadequate to account metaphorically for SPORT. To cater for an inclusive vehicle/CM, we selected WIN and LOSS lexicon from the news coverage of Brazil’s football team loss to Germany and tested them through the Corpus of Contemporary American English. Then, the data were studied through the 3 stages of metaphor research. In t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012